Code to clean the data file-by-file

Importing the necessary libraries

In [1]:
import pandas as pd
import csv
import string
import re
import nltk

nltk.download('stopwords')
nltk.download('names')
from nltk.corpus import stopwords
from nltk.corpus import names
from nltk import word_tokenize
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Aruna\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package names to
[nltk_data]     C:\Users\Aruna\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!
In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

%matplotlib inline
pd.set_option('display.max_colwidth', 150)

(A) Read the CSV File

In [3]:
df = pd.read_csv("C:\\Users\\Aruna\\Documents\\input\\Amazon Elastic Beanstalk.csv")

df['description'] = df['description'].apply(lambda x: " ".join(x for x in str(x).split())) # converting to string
 
df.head(10)
Out[3]:
id label description
0 7471 Amazon Elastic Beanstalk http-err 302 in sqsd logs I keep getting this error in /var/log/aws-sqsd/default.log file http-err: 4d45561e-a135-4566-ab18-2bdd1f134ff3 (1) 302 -...
1 7471 Amazon Elastic Beanstalk Moved to the "AWS Elastic Beanstalk" forum, as this daemon is part of a worker environment.
2 7470 Amazon Elastic Beanstalk Remove default static file path of /static/ Under the software configuration there is a default static file path of "/static/" that maps to a "sta...
3 7469 Amazon Elastic Beanstalk Elastic Beanstalk not working with HTTPS I have deployed my Django app in AWS elastic beanstalk and successfully installed SSL. But HTTPS is not w...
4 7468 Amazon Elastic Beanstalk Error - "A problem occurred while loading your page: Rate exceeded" Earlier today, we were locked out of all of our Elasticbeanstalk applications ...
5 7468 Amazon Elastic Beanstalk I have been seeing this message the past few days while attempting to access any of our EB applications via the console. The CLI has been working ...
6 7468 Amazon Elastic Beanstalk Message below was from AWS Support. They reverted a change made to the Beanstalk console to resolve this problem. However, they are unable to expl...
7 7468 Amazon Elastic Beanstalk This was a bug in the AWS elasticbeanstalk console.
8 7467 Amazon Elastic Beanstalk Deployment fails after OS upgrade due to locked file - AWSSDK.dll? We upgraded our .NET framework of our application to 4.6.2 and as a result had ...
9 7466 Amazon Elastic Beanstalk Cannot turn off my experimental website due to account problem Hi there, I have a problem as I have an experimental website running in AWS althoug...
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22345 entries, 0 to 22344
Data columns (total 3 columns):
id             22345 non-null int64
label          22345 non-null object
description    22345 non-null object
dtypes: int64(1), object(2)
memory usage: 523.8+ KB

Check out one sample post:

In [5]:
p = 2000

df['description'][p]
Out[5]:
"Stack delete failure due to [AWSEBWorkerCronLeaderRegistry] Hi, a coworker created a Worker Environment on accident a week ago and he's been trying to terminate the instance without success since. Termination has been failing with error Stack deletion failed: The following resource(s) failed to delete: https://forums.aws.amazon.com/. I thought deleting the auto-created DynamoDB table would allow it to be terminated but no change in error message even after. Since there's no instances, there's no logs to look at and we're unable to change any configuration as it's not in a ready state. Pretty stuck on what should be done with this."

Top 30 words + frequency of each:

In [6]:
pd.Series(' '.join(df['description']).split()).value_counts()[:30]
Out[6]:
the            82276
to             62598
I              39415
a              31104
and            27535
is             25296
in             23449
for            18142
that           17451
on             17034
of             16841
you            16202
it             15901
this           15092
with           13102
have           12738
not            11391
be             10277
my             10238
your            9654
from            9372
an              9150
-               8622
environment     8590
can             8429
are             8249
instance        8117
but             8076
Beanstalk       7943
as              7553
dtype: int64
In [7]:
print("There are totally", df['description'].apply(lambda x: len(x.split(' '))).sum(), "words before cleaning.")
There are totally 2063115 words before cleaning.

(B) Text Pre-processing

In [8]:
STOPWORDS = stopwords.words('english')
my_stop_words = ["hi", "hello", "regards", "thank", "thanks", "regard", "best", "wishes", "hey", "amazon", "aws", "s3",
"elastic", "beanstalk", "rds", "ec2", "lambda", "cloudfront", "cloud", "front", "vpc", "sns", "me",
"january", "february", "march", "april", "may", "june", "july", "august", "september", "october", 
"november", "december", "jan", "feb", "mar", "apr", "jun", "jul", "aug", "sep", "sept", "oct", "nov",
"dec", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday", "mon", "tue",
"wed", "thu", "fri", "sat", "sun", "ain't", "aren't", "can't", "can't've", "'cause", "could've", "couldn't",
"couldn't've", "didn't", "doesn't", "don't", "hadn't", "hadn't've", "hasn't", "haven't", "he'd", "he'd've",
"he'll", "he'll've", "he's", "how'd", "how'd'y", "how'll", "how's", "i'd", "i'd've", "i'll", "i'll've", "i'm",
"i've", "isn't", "it'd", "it'd've", "it'll", "it'll've", "it's", "let's", "mayn't", "might've", "mightn't",
"mightn't've", "must've", "mustn't", "mustn't've", "needn't", "needn't've", "oughtn't", "oughtn't've", "shan't",
"sha'n't", "shan't've", "she'd", "she'd've", "she'll", "she'll've", "she's", "should've", "shouldn't", "shouldn't've",
"so've", "so's", "that'd", "that'd've", "that's", "there'd", "there'd've", "there's", "they'd", "they'd've", "they'll",
"they'll've", "they're", "they've", "to've", "wasn't", "we'd", "we'd've", "we'll", "we'll've", "we're", "we've",
"weren't", "what'll", "what'll've", "what're", "what's", "what've", "when's", "when've", "where'd", "where's",
"where've", "who'll", "who'll've", "who's", "who've", "why's", "why've", "will've", "won't", "won't've", "would've",
"wouldn't", "wouldn't've", "yall", "yalld", "yalldve", "yallre", "yallve", "youd", "youdve", "youll",
"youllve", "youre", "youve", "do", "did", "does", "had", "have", "has", "could", "can", "as", "is",
"shall", "should", "would", "will", "you", "me", "please", "know", "who", "we", "was", "were", "edited", "by", "pm"]

name = names.words()
STOPWORDS.extend(my_stop_words)
STOPWORDS.extend(name)

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,:;#+?]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z - _.]+')
REMOVE_HTML_RE = re.compile(r'<.*?>')
REMOVE_HTTP_RE = re.compile(r'http\S+')

STOPWORDS = [BAD_SYMBOLS_RE.sub('', x) for x in STOPWORDS]

Convert to lowercase

In [9]:
df['description'] = df['description'].apply(lambda x: " ".join(x.lower() for x in str(x).split(" ")))

df['description'][p]
Out[9]:
"stack delete failure due to [awsebworkercronleaderregistry] hi, a coworker created a worker environment on accident a week ago and he's been trying to terminate the instance without success since. termination has been failing with error stack deletion failed: the following resource(s) failed to delete: https://forums.aws.amazon.com/. i thought deleting the auto-created dynamodb table would allow it to be terminated but no change in error message even after. since there's no instances, there's no logs to look at and we're unable to change any configuration as it's not in a ready state. pretty stuck on what should be done with this."

Remove all HTML tags

In [10]:
df['description'] = df['description'].apply(lambda x: " ".join(REMOVE_HTML_RE.sub(' ', x) for x in str(x).split()))

df['description'][p]
Out[10]:
"stack delete failure due to [awsebworkercronleaderregistry] hi, a coworker created a worker environment on accident a week ago and he's been trying to terminate the instance without success since. termination has been failing with error stack deletion failed: the following resource(s) failed to delete: https://forums.aws.amazon.com/. i thought deleting the auto-created dynamodb table would allow it to be terminated but no change in error message even after. since there's no instances, there's no logs to look at and we're unable to change any configuration as it's not in a ready state. pretty stuck on what should be done with this."
In [11]:
df['description'] = df['description'].apply(lambda x: " ".join(REMOVE_HTTP_RE.sub(' ', x) for x in str(x).split()))

df['description'][p]
Out[11]:
"stack delete failure due to [awsebworkercronleaderregistry] hi, a coworker created a worker environment on accident a week ago and he's been trying to terminate the instance without success since. termination has been failing with error stack deletion failed: the following resource(s) failed to delete:   i thought deleting the auto-created dynamodb table would allow it to be terminated but no change in error message even after. since there's no instances, there's no logs to look at and we're unable to change any configuration as it's not in a ready state. pretty stuck on what should be done with this."

Replace certain characters by space (quotation marks, parantheses etc)

In [12]:
df['description'] = df['description'].apply(lambda x: " ".join(REPLACE_BY_SPACE_RE.sub(' ', x) for x in str(x).split()))

df['description'][p]
Out[12]:
"stack delete failure due to  awsebworkercronleaderregistry  hi  a coworker created a worker environment on accident a week ago and he's been trying to terminate the instance without success since. termination has been failing with error stack deletion failed  the following resource s  failed to delete  i thought deleting the auto-created dynamodb table would allow it to be terminated but no change in error message even after. since there's no instances  there's no logs to look at and we're unable to change any configuration as it's not in a ready state. pretty stuck on what should be done with this."

Remove any unwanted symbols (like $, @ etc)

In [13]:
df['description'] = df['description'].apply(lambda x: " ".join(BAD_SYMBOLS_RE.sub('', x) for x in str(x).split()))

df['description'][p]
Out[13]:
'stack delete failure due to awsebworkercronleaderregistry hi a coworker created a worker environment on accident a week ago and hes been trying to terminate the instance without success since. termination has been failing with error stack deletion failed the following resource s failed to delete i thought deleting the autocreated dynamodb table would allow it to be terminated but no change in error message even after. since theres no instances theres no logs to look at and were unable to change any configuration as its not in a ready state. pretty stuck on what should be done with this.'

Remove trailing punctuation marks and any symbol patterns

In [14]:
df['description'] = df['description'].apply(lambda x: " ".join(x.strip('.') for x in x.split()))
df['description'] = df['description'].apply(lambda x: " ".join(x.strip('-') for x in x.split()))
df['description'] = df['description'].apply(lambda x: " ".join(x.strip('_') for x in x.split()))
df['description'][p]
Out[14]:
'stack delete failure due to awsebworkercronleaderregistry hi a coworker created a worker environment on accident a week ago and hes been trying to terminate the instance without success since termination has been failing with error stack deletion failed the following resource s failed to delete i thought deleting the autocreated dynamodb table would allow it to be terminated but no change in error message even after since theres no instances theres no logs to look at and were unable to change any configuration as its not in a ready state pretty stuck on what should be done with this'

Remove any numbers

In [15]:
df['description'] = df['description'].apply(lambda x: " ".join(x for x in x.split() if not x.isdigit()))

df['description'][p]
Out[15]:
'stack delete failure due to awsebworkercronleaderregistry hi a coworker created a worker environment on accident a week ago and hes been trying to terminate the instance without success since termination has been failing with error stack deletion failed the following resource s failed to delete i thought deleting the autocreated dynamodb table would allow it to be terminated but no change in error message even after since theres no instances theres no logs to look at and were unable to change any configuration as its not in a ready state pretty stuck on what should be done with this'

Remove the stop words

In [16]:
df['description'] = df['description'].apply(lambda x: " ".join(x for x in x.split() if x not in STOPWORDS
                                                               and len(x) > 1))

df['description'][p]
Out[16]:
'stack delete failure due awsebworkercronleaderregistry coworker created worker environment accident week trying terminate instance without success since termination failing error stack deletion failed following resource failed delete thought deleting autocreated dynamodb table allow terminated change error message even since instances logs look unable change configuration ready state pretty stuck done'

Results after cleaning data:

In [17]:
df.head()
Out[17]:
id label description
0 7471 Amazon Elastic Beanstalk sqsd logs keep getting error var log awssqsd default.log file 4d45561ea1354566ab182bdd1f134ff3 0.006 numbers letters changing time mean
1 7471 Amazon Elastic Beanstalk moved forum daemon part worker environment
2 7470 Amazon Elastic Beanstalk remove default static file path static software configuration default static file path static maps static directory want remove special way handli...
3 7469 Amazon Elastic Beanstalk working deployed django app successfully installed ssl working showing refused connect chrome browser
4 7468 Amazon Elastic Beanstalk error problem occurred loading page rate exceeded earlier today locked elasticbeanstalk applications console click one get problem occurred loadin...

Top 30 words + frequency of each:

In [18]:
pd.Series(' '.join(df['description']).split()).value_counts()[:30]
Out[18]:
environment         14173
instance            12138
application          9976
error                9008
file                 8290
info                 7106
new                  6879
using                6788
command              6375
running              6329
elasticbeanstalk     6081
app                  5851
get                  5330
instances            5240
version              4990
failed               4764
use                  4708
deploy               4512
lib                  4476
issue                4435
opt                  4365
like                 4030
debug                4026
see                  3998
one                  3886
log                  3864
configuration        3819
problem              3759
root                 3621
server               3620
dtype: int64
In [19]:
print("There are totally", df['description'].apply(lambda x: len(x.split(' '))).sum(), "words after cleaning.")
There are totally 1094639 words after cleaning.

(C) Write to CleanText.csv

In [20]:
with open('C:\\Users\\Aruna\\Documents\\ACMS-IID\\input\\CleanText.csv', 'a', encoding='utf-8', newline='') as csvfile:
    writer = csv.writer(csvfile)
    # writer.writerow(['id', 'label', 'description'])
    for i in range(0, len(df['description'])):
        if len(df['description'][i]) > 1:
            writer.writerow([df['id'][i], df['label'][i], df['description'][i]])

(D) Generate the word cloud

In [21]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 20, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[21]:
(-0.5, 399.5, 199.5, -0.5)
In [22]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 50, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[22]:
(-0.5, 399.5, 199.5, -0.5)
In [23]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 100, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[23]:
(-0.5, 399.5, 199.5, -0.5)
In [24]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 500, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[24]:
(-0.5, 399.5, 199.5, -0.5)
In [25]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 1000, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[25]:
(-0.5, 399.5, 199.5, -0.5)
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: